Our Earth is an unique planet in the solar system holding the key to sustain Life. Even now many parts of the oceans, and forests remain unexplored. Several species are discovered periodically in remote forests across the globe. Under further genetic studies it is revealed that we humans do posses some characteristics of the DNA of the mammals!. Exploration of such remote regions can provide answers for many unanswered questions regarding origin of life, evolution, and life in unpolluted environments.
However such explorations are costly and even dangerous for humans. Aerial surveillance have improved much during these last decades. These technological improvements can assist in human explorations by capturing images of unexplored regions reducing cost and human effort. But most forests are dense with vegetation and such aerial surveillance can provide only the top view. We can then use the surveillance data(tabular data) to predict and restrict the search space for human explorations. From these explorations an extrapolation can be done about missing species in similar areas, thus reducing the search space further.
This project aims to use surveillance data(tabular) with cartographic information(land structures) to train models that can predict and identify the forest cover types. The project also aims to create a classification model with more than 90\% accuracy which can classify the forest cover type.
The performance of the model can be evaluated by calculating the prediction accuracy. For testing, 20\% of the total data set will be used which is not used to train the model. All classification models will be trained with 80\% of the data. This huge amount of data used to train the models can increase the prediction accuracy over the unseen data.
Performance of various classifiers can be ranked and finally we can chose highest ranked classifier using all of the following properties,
UCI Machine Learning Repository The data used in the project can be downloaded from following url, https://archive.ics.uci.edu/ml/datasets/Covertype
If a dataset is present, features and calculated statistics relevant to the problem have been reported and discussed, along with a sampling of the data. In lieu of a dataset, a thorough description of the input space or input data has been made. Abnormalities or characteristics about the data or input that need to be addressed have been identified.
In [1]:
import os
import pandas as pd
import numpy as np
from IPython.display import display
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
In [2]:
# Loading the data from the local folder
os.chdir('data/')
In [ ]:
# Structure of raw tabular data
"""
Name Data Type Measurement Description
Elevation quantitative meters Elevation in meters
Aspect quantitative azimuth Aspect in degrees azimuth
Slope quantitative degrees Slope in degrees
Horizontal_Distance_To_Hydrology quantitative meters Horz Dist to nearest surface water features
Vertical_Distance_To_Hydrology quantitative meters Vert Dist to nearest surface water features
Horizontal_Distance_To_Roadways quantitative meters Horz Dist to nearest roadway
Hillshade_9am quantitative 0 to 255 index Hillshade index at 9am, summer solstice
Hillshade_Noon quantitative 0 to 255 index Hillshade index at noon, summer soltice
Hillshade_3pm quantitative 0 to 255 index Hillshade index at 3pm, summer solstice
Horizontal_Distance_To_Fire_Points quantitative meters Horz Dist to nearest wildfire ignition points
Wilderness_Area (4 binary columns) qualitative 0 (absence) or 1 (presence) Wilderness area designation
Soil_Type (40 binary columns) qualitative 0 (absence) or 1 (presence) Soil Type designation
Cover_Type (7 types) integer 1 to 7 Forest Cover Type designation
"""
In [3]:
# Reading the data from the input file
data = pd.read_csv('covtype.data', sep=",", header = None)
# Assigning column names to the dataframe data
data.columns = ['elevation', 'aspect', 'slope', 'horizontal_distance_to_hydrology', 'vertical_distance_to_hydrology',
'horizontal_distance_to_road_ways', 'hillshade_9am', 'hillshade_noon', 'hillshade_3pm',
'horizontal_distance_to_fire_points', 'wilderness_area_1', 'wilderness_area_2',
'wilderness_area_3', 'wilderness_area_4',
'soil_type_1', 'soil_type_2', 'soil_type_3', 'soil_type_4', 'soil_type_5',
'soil_type_6', 'soil_type_7', 'soil_type_8', 'soil_type_9', 'soil_type_10',
'soil_type_11', 'soil_type_12', 'soil_type_13', 'soil_type_14', 'soil_type_15',
'soil_type_16', 'soil_type_17', 'soil_type_18', 'soil_type_19', 'soil_type_20',
'soil_type_21', 'soil_type_22', 'soil_type_23', 'soil_type_24', 'soil_type_25',
'soil_type_26', 'soil_type_27', 'soil_type_28', 'soil_type_29', 'soil_type_30',
'soil_type_31', 'soil_type_32', 'soil_type_33', 'soil_type_34', 'soil_type_35',
'soil_type_36', 'soil_type_37', 'soil_type_38', 'soil_type_39', 'soil_type_40',
'cover_type']
# Extracting all the features from the dataframe
feature_cols = list(data.columns[:-1])
# Extracting the target attribute from the dataframe
target_col = data.columns[-1]
# Assigning the features into X_all variable
X_all = data[feature_cols]
# Assigning the target attribute into y_all variable
y_all = data[target_col]
# print "\nFeature values:-"
# Printing the first row in X_all and y_all attribute
print X_all.head(1)
print y_all.head(1)
In [5]:
# Information about the features data
print X_all.info()
In [6]:
# Describing about target variable
display(y_all.describe())
In [7]:
# describing the features upto 5th column,
display(X_all.ix[:,:5].describe())
# dislaying the features from 6th column to 10th column
display(X_all.ix[:,5:10].describe())
# NOTE: wilderness_area and soil_type are binary valued columns, so they are not described here
In [10]:
# Counting the occurences of each target class
print data.groupby('cover_type').size().sort_values(ascending=False)
In [8]:
# Reference https://en.wikipedia.org/wiki/Outlier
cols = data.columns # Reads the column names from the dataframe
# A dat point is an outlier if it differs from the attributes standard deviation times 3
temp_calc = data[cols].apply(lambda x: np.abs(x-x.mean())/x.std() > 3.0)
not_outliers = temp_calc[temp_calc.apply(pd.Series.value_counts, axis=1)[0] > 6.0].index.tolist()
print " * Number of outliers (deviants in more columns) = %d" %(len(data)-len(not_outliers))
In [14]:
# Also checking for any attribute with standard deviation zero
# Result : There are no columns with zero standard deviation. This proves that
# all the columns have meaningful information to be considered for the classification algorithms
def check_for_zero_sd(temp_data, temp_cols):
for t_col in temp_cols:
c_sd = temp_data[t_col].std()
if c_sd == 0:
print t_col
check_for_zero_sd(data, data.columns)
Correlation among attributes can help in understanding their relationship better. Further more this relationship can help us in training the classification algorithms better. Only attributes with continuous calues can be used for correlation. So in our data 'wilderness_area' and 'soil_type' attributes cannot be used for correlations calculation.
In [18]:
correlated_attributes = [] # a list to store the correlated attributes
column_index_limit = 10 # columns until all attrubytes have categorical data
temp_data=data.iloc[:,:column_index_limit] # extracting the categorical data from the dataframe
cols=temp_data.columns # extracting all the column names
data_correlated_values = temp_data.corr() # correlating all the extracted columns
for i in range(0, column_index_limit): # iterating over the rows
for j in range(i + 1, column_index_limit): # iterating over the columns but with one less column than i
# following condition checks for all the correlated values which are > than 0,5 score and < 1.0
condition_1 = (data_correlated_values.iloc[i,j] < 1.0) and (data_correlated_values.iloc[i,j] >= 0.50)
# following condition checks for all the correlated values which are < 0.0 and less than -0.5
condition_2 = (data_correlated_values.iloc[i,j] < 0.0) and (data_correlated_values.iloc[i,j] <= -0.50)
# we consider both positive and negative values because +ve indicates direct correlation while
# -ve score indicates an indirect correlation
if condition_1 or condition_2:
# if the condition is true add it to the list
correlated_attributes.append([data_correlated_values.iloc[i,j], i, j])
for val, i, j in sorted(correlated_attributes, key=lambda x: -abs(x[0])):
# iterate over the loist to print the columsn and the score
print ("%s and %s = %.2f" % (cols[i], cols[j], val))
# the printed values are worth considering specially in classifier construyctions
In [14]:
# Listing the count of the different categories of the target variable
_= sns.distplot(data.cover_type)
In [26]:
# Leaving out binary variables wilderness_area, soil_type. Since they cannot provide more ifnormation about
# the distributin of the data when comapared to other variables
X_new = data[['elevation', 'aspect', 'slope', 'horizontal_distance_to_hydrology', 'vertical_distance_to_hydrology',
'horizontal_distance_to_road_ways', 'hillshade_9am', 'hillshade_noon', 'hillshade_3pm',
'horizontal_distance_to_fire_points', 'cover_type']]
_ = pd.scatter_matrix(X_new.sample(n=10000), alpha = 0.3, figsize = (10,20), diagonal = 'kde')
Examining above scatter plot many interesting relationships can be observed between the categorical attributes. However the figure is not big enough to show all relationships clearly. The folowing figure try to show them in detail and one plot at a time.
These attributes have higher correlation(already calculated in section 2.1) with score of
In [43]:
sns.pairplot(data, hue="cover_type", size=5, x_vars = "aspect", y_vars = "hillshade_3pm")
plt.show()
In [44]:
sns.pairplot(data, hue="cover_type", size=5, x_vars = "aspect", y_vars = "hillshade_9am")
plt.show()
In [45]:
sns.pairplot(data, hue="cover_type", size=5, x_vars = "hillshade_9am", y_vars = "hillshade_3pm")
plt.show()
In [46]:
sns.pairplot(data, hue="cover_type", size=5, x_vars = "hillshade_noon", y_vars = "hillshade_3pm")
plt.show()
Above two plot shows a elliptic realtion between the attributes 'hillshade_noon' and 'hillshade_3pm', and also between the variables 'hillshade_9am' and 'hillshade_3pm'. This relationship is obvious, since the hill's shade depends upon sun light. The Sun light is recevied by Earth undergoing elliptical orbit, so all shade will be elliptical upto a degree.
A book titled 'A Mathematical Nature Walk' and in page number 58, Question 34 discusses ways to evaluate a height of a tree based upon the "elliptic" shadows. The mathematical
In [47]:
sns.pairplot(data, hue="cover_type", size=5, x_vars = "horizontal_distance_to_hydrology", y_vars = "vertical_distance_to_hydrology")
plt.show()
Above plot depicts a strong linear patter between attributes 'vertical_distance_to_hydrology' and 'horizontal_distance_to_hydrology'.
In [48]:
sns.pairplot(data, hue="cover_type", size=5, x_vars = "slope", y_vars = "hillshade_noon")
plt.show()
In [50]:
sns.pairplot(data, hue="cover_type", size=5, x_vars = "elevation", y_vars = "horizontal_distance_to_road_ways")
plt.show()
In [64]:
sns.distplot(data['elevation'].sample(n=500000), hist=False, rug=True);
plt.show()
Attribute 'elevation' looks like a normal dirstribution with a single peak with another local peak.
In [8]:
sns.distplot(data['aspect'].sample(n=10000), hist=False, rug=True);
plt.show()
Attribute 'aspect' contains atleast two normal distributions.
In [54]:
sns.distplot(data['slope'].sample(n=10000), hist=False, rug=True);
plt.show()
Attribute 'slope' looks like a normal dirstribution.
In [55]:
sns.distplot(data['horizontal_distance_to_hydrology'].sample(n=10000), hist=False, rug=True);
plt.show()
Attribute 'horizontal_distance_to_hydrology' looks like a normal dirstribution.
In [56]:
sns.distplot(data['vertical_distance_to_hydrology'].sample(n=10000), hist=False, rug=True);
plt.show()
Attribute 'vertical_distance_to_hydrology' looks like a normal dirstribution. Interestingly the distribution is peaked aroud value 0.
In [57]:
sns.distplot(data['horizontal_distance_to_road_ways'].sample(n=10000), hist=False, rug=True);
plt.show()
Attribute 'horizontal_distance_to_road_ways' looks like a normal dirstribution.
In [58]:
sns.distplot(data['hillshade_9am'].sample(n=10000), hist=False, rug=True);
plt.show()
Attribute 'hillshade_9am' looks like a left-skewed normal dirstribution.
In [59]:
sns.distplot(data['hillshade_noon'].sample(n=10000), hist=False, rug=True);
plt.show()
Attribute 'hillshade_noon' looks like a left-skewed normal dirstribution.
In [60]:
sns.distplot(data['hillshade_3pm'].sample(n=10000), hist=False, rug=True);
plt.show()
Attribute 'hillshade_3pm' looks like a normal dirstribution.
In [61]:
sns.distplot(data['horizontal_distance_to_fire_points'].sample(n=10000), hist=False, rug=True);
plt.show()
Attribute 'horizontal_distance_to_fire_points' looks like a normal dirstribution.
In [63]:
sns.distplot(data['cover_type'].sample(n=100000), hist=False, rug=True);
plt.show()
The above 'cover_type' is a target variable. So this is not considered for the discussion about distribution. It still shows the distribution of the results from 1 to 7.
The target variable consists of values from 1 to 7. This implies our classifer must be a multiclass classifier. Some well known algorithms are listed below. We run those algorithms against the data and can pick the best peforming algorithm then fine tune it for further analysis and improving the accuracy.
https://en.wikipedia.org/wiki/Multiclass_classification
We can evaluate the performance of the above classifiers by measuring their precision, recall, and accuracy score. Here Precision measures the percentage of true positive among true positive and false positive, while Recall measures the percentage of true positive among true positive and false negative, while Accuracy measures the percentage of true positive and true negative among true postive, true negative, false positive and false negative.
Finally, we prefer using the classifier which has the highest scores, for any future forest cover type predictions.
In [5]:
warnings.filterwarnings("ignore", category=DeprecationWarning)
from sklearn.model_selection import cross_val_score
from sklearn import model_selection
from sklearn import metrics
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn import svm
import xgboost as xgb
import time
In [6]:
# Preparing the training and the test data
def shuffle_data(X, y):
'''
This method divodes the data into test and training set.
The training set consists of 80% of the total data.
The test set consists of 20% of the total data.
'''
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, train_size = 0.8, random_state=42)
return X_train, y_train, X_test, y_test
X_train, y_train, X_test, y_test = shuffle_data(X_all, y_all)
print "Training set: {} samples".format(X_train.shape[0])
print "Test set: {} samples".format(X_test.shape[0])
# Scaling all the attributes
scaler = StandardScaler()
scaler.fit(X_train)
X_train_tfm = scaler.transform(X_train)
X_test_tfm = scaler.transform(X_test)
In [ ]:
clf_list = [LinearSVC(),
SGDClassifier(n_jobs=-1),
KNeighborsClassifier(n_jobs=-1),
RandomForestClassifier(n_jobs=-1),
xgb.XGBClassifier(nthread=-1),
svm.SVC(kernel='sigmoid'), # It was running for ling time like hours, so I decided not to fine tune it, "Curse of dimensionality" at its best
svm.SVC(kernel='rbf') # It was running for ling time like hours, so I decided not to fine tune it]
# the following code calculates some basic mean accuracy scores of chosen classifiers.
for clf in clf_list:
clf.fit(X_train, y_train)
print clf.score(X_test, y_test)
From the predicted values of the classifiers we have calculated the mean accuracy. They are listed above, From those list, three classifiers have very high accuracy: KNeighborsClassifier(mean accuracy ~ 0.966997409705), RandomForestClassifier(mean accuracy ~ 0.942712322401), and XGBClassifier(mean accuracy ~ 0.744025541509). We select these three for further fine tuning.
Finds n_neighbors nearest to a data point and base upon the neighbors class assigned, the most common class among the neighbors is chosen as the class for the data point. In the following classifier, nearest ten neighbors are considered, and the distance between the neighbr and the data point is measured using 'minkowski distance'. The algorithm is fine tuned to use all cores depending upon the hardware.
https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
http://scikit-learn.org/stable/modules/neighbors.html#classification
In [9]:
clf_knn = KNeighborsClassifier(n_neighbors=10,
weights='uniform',
algorithm='auto',
leaf_size=10000,
p=2,
metric='minkowski',
metric_params=None,
n_jobs=-1)
# training the classifier
start = time.time()
clf_knn.fit(X_train, y_train)
end = time.time()
time_knn_train = end - start
# predicting the X_test values
start = time.time()
y_predicted = clf_knn.predict(X_test)
end = time.time()
time_knn_test = end - start
# testing the classifier using the cross validation score
print "Results for KNeighborsClassifier - on unscaled data:"
print metrics.classification_report(y_test, y_predicted)
print metrics.accuracy_score(y_test ,y_predicted)
# Scaled Values sccuracy results
# Fitting
start = time.time()
clf_knn.fit(X_train_tfm ,y_train)
end = time.time()
time_knn_tfm_train = end - start
# predicting
start = time.time()
y_predicted_knn = clf_knn.predict(X_test_tfm)
end = time.time()
time_knn_tfm_test = end - start
# Cross validation evaluation results and mean accuravcy score
print "Results for KNeighborsClassifier - on scaled data:"
print metrics.classification_report(y_test, y_predicted_knn)
print metrics.accuracy_score(y_test, y_predicted_knn)
In [10]:
# Running Time
print "Running time in seconds"
print "For unscaled data,"
print "train data : ", float(time_knn_train)
print "test data : ", float(time_knn_test)
print "For scaled data,"
print "train data : ", float(time_knn_tfm_train)
print "test data :", float(time_knn_tfm_test)
Random Forests belong to a class of algorithms derived fro randomized decision trees.
"In random forests (see RandomForestClassifier and RandomForestRegressor classes), each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features. As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model."
The RandomForestClassifier is also fine tuned to use all cores in the hardware.
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
In [11]:
clf_rfc = RandomForestClassifier(n_estimators=500, # Increased the values -
criterion='gini',
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
min_weight_fraction_leaf=0.0,
max_features='auto',
max_leaf_nodes=None,
min_impurity_split=1e-07,
bootstrap=True,
oob_score=False,
n_jobs=-1,
random_state=42,
verbose=0,
warm_start=False,
class_weight=None)
# training the classifier
start = time.time()
clf_rfc.fit(X_train, y_train)
end = time.time()
time_rfc_train = end - start
# testing the classifier and printing its mean accuracy
# print clf_rfc.score(X_test, y_test)
start = time.time()
y_predicted_rfc = clf_rfc.predict(X_test)
end = time.time()
time_rfc_test = end - start
# testing the classifier using the cross validation score
# print (cross_val_score(clf_rfc, X_test, y_test))
print "Results for RandomForestClassifier - on unscaled data:"
print metrics.classification_report(y_test, y_predicted_rfc)
print metrics.accuracy_score(y_test ,y_predicted_rfc)
# Scaled Values sccuracy results
# Fitting
start = time.time()
clf_rfc.fit(X_train_tfm ,y_train)
end = time.time()
time_rfc_tfm_train = end - start
start = time.time()
y_predicted_rfc_tfm = clf_rfc.predict(X_test_tfm)
end = time.time()
time_rfc_tfm_test = end - start
# Cross validation eval
print "Results for RandomForestClassifier - on scaled data:"
print metrics.classification_report(y_test, y_predicted_rfc_tfm)
print metrics.accuracy_score(y_test, y_predicted_rfc_tfm)
In [8]:
# Running Time in seconds
print "Running time for RandomForestClassifier:"
print "For unscaled data,"
print "train data : ", float(time_rfc_train)
print "test data : ", float(time_rfc_test)
print "For scaled data,"
print "train data : ", float(time_rfc_tfm_train)
print "test data :", float(time_rfc_tfm_test)
XGBClassifier belongs to the boosted version of Gradient Boosting classifiers. This is obtained as a pip package from github(see reference). This classifier is built upon the concept of 'Gradient Boosting'.
"Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function."
The XGBClassifier is also fine tuned to use all cores in the hardware.
https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/sklearn.py
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
In [12]:
# Train and predict using the XGBoost algorithm
clf_xgbc = xgb.XGBClassifier(max_depth=50,
n_estimators=500,
silent=True,
objective="multi:softmax",
nthread=-1,
gamma=0,
min_child_weight=1,
max_delta_step=0,
subsample=1,
colsample_bytree=0.3,
colsample_bylevel=1,
reg_alpha=0,
reg_lambda=1,
scale_pos_weight=1,
seed=0,
base_score=0.5,
missing=None)
# Fitting
start = time.time()
clf_xgbc.fit(X_train ,y_train)
end = time.time()
time_xgbc_train = end - start
start = time.time()
y_predicted_xgbc = clf_xgbc.predict(X_test)
end = time.time()
time_xgbc_test = end - start
# Cross validation eval
print "Results for XGBClassifier - on unscaled data:"
print metrics.classification_report(y_test, y_predicted_xgbc)
print metrics.accuracy_score(y_test ,y_predicted_xgbc)
# Fitting
start = time.time()
clf_xgbc.fit(X_train_tfm ,y_train)
end = time.time()
time_xgbc_tfm_train = end - start
start = time.time()
y_predicted_xgbc_tfm = clf_xgbc.predict(X_test_tfm)
end = time.time()
time_xgbc_tfm_test = end - start
# Cross validation eval
print "Results for XGBClassifier - on scaled data:"
print metrics.classification_report(y_test, y_predicted_xgbc_tfm)
print metrics.accuracy_score(y_test ,y_predicted_xgbc_tfm)
In [13]:
# Running Time in minutes
print "Running time for XGBClassifier:"
print "For unscaled data,"
print "train data : ", float(time_xgbc_train)/60.0
print "test data : ", float(time_xgbc_test)/60.0
print "For scaled data,"
print "train data : ", float(time_xgbc_tfm_train)/60.0
print "test data :", float(time_xgbc_tfm_test)/60.0
In [16]:
# Plotting the running time of all three classifiers
time_plot_data = [
{"classifier":"u_KNN", "time:":time_knn_test, "type":"test"},
{"classifier":"u_KNN", "time":time_knn_train, "type":"train"},
{"classifier":"s_KNN", "time":time_knn_tfm_test, "type":"test"},
{"classifier":"s_KNN", "time":time_knn_tfm_train, "type":"train"},
{"classifier":"u_RFC", "time":time_rfc_test, "type":"test"},
{"classifier":"u_RFC", "time":time_rfc_train, "type":"train"},
{"classifier":"s_RFC", "time":time_rfc_tfm_test, "type":"test"},
{"classifier":"s_RFC", "time":time_rfc_tfm_train, "type":"train"},
{"classifier":"u_XGBC", "time":time_xgbc_test, "type":"test"},
{"classifier":"u_XGBC", "time":time_xgbc_train, "type":"train"},
{"classifier":"s_XGBC", "time":time_xgbc_tfm_test, "type":"test"},
{"classifier":"s_XGBC", "time":time_xgbc_tfm_train, "type":"train"}
]
sns.set_style("whitegrid")
sns_time_plot_data = pd.DataFrame(time_plot_data)
ax = sns.barplot(x="classifier", y="time", hue="type", data=sns_time_plot_data)
plt.show()
From the above three classifiers, we can notice that after fine tuning, XGBClassifier performance is incresed by a huge margin. Also an accuracy increase from ~0.74 to ~0.94 is observed. Also our initital plan to achieve an accuracy of above 90% is also achieved here. However, RandomForestClassifier performs consistently better when considered to all other classifiers. It produces an accuracy score of ~0.96 which is the highest obsereved in this project. Also it is the fastest classifier considered in this project.
So from the list of classifiers used in this project I would recommend RandomForestClassifier to be used for further analysis in similar type of forest cover identification. Since it has consistent highest accuracy score.
In [ ]: